{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": [],
      "toc_visible": true,
      "gpuType": "T4"
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    },
    "accelerator": "GPU"
  },
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "# Tutorial: Automatic Speech Recognition with ReazonSpeech"
      ],
      "metadata": {
        "id": "gOHDZ45jDwaG"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "In this tutorial, we perform Japanese speech recognition using ReazonSpeech v2.0.\n",
        "\n",
        "（Note: Choose a GPU instance in 'Runtime > Change runtime type' for acceleration)"
      ],
      "metadata": {
        "id": "39F7gDDxEEIZ"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Set up ReazonSpeech\n",
        "\n",
        "First, install ReazonSpeech python package:"
      ],
      "metadata": {
        "id": "Ub-eqNS9GM-Q"
      }
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "6RDMBcSnAvAW"
      },
      "outputs": [],
      "source": [
        "!apt-get install libsndfile1 ffmpeg\n",
        "!git clone https://github.com/reazon-research/reazonspeech\n",
        "!pip install --no-warn-conflicts reazonspeech/pkg/nemo-asr"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Download an audio file\n",
        "\n",
        "Next download an example audio file:"
      ],
      "metadata": {
        "id": "tqu-h6PhGTQ0"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "!curl -O https://research.reazon.jp/_static/demo.mp3\n",
        "\n",
        "from IPython.display import Audio, display\n",
        "display(Audio(\"demo.mp3\"))"
      ],
      "metadata": {
        "id": "QhDP0IBrGbJL"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Perform speech recognition\n",
        "\n",
        "Now that the setup is ready, we can start perform Japanese speech recognition.\n",
        "\n",
        "The following Python code shows how to do it:"
      ],
      "metadata": {
        "id": "Q6wDpzqHELYE"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from reazonspeech.nemo.asr import transcribe, audio_from_path, load_model\n",
        "\n",
        "# Download ReazonSpeech model from Hugging Face\n",
        "model = load_model()"
      ],
      "metadata": {
        "id": "r3eVjQumENnd"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Perform speech recognition\n",
        "audio = audio_from_path(\"demo.mp3\")\n",
        "ret = transcribe(model, audio)\n",
        "\n",
        "# Output\n",
        "print(\"\\n## Result\")\n",
        "print(ret.text)"
      ],
      "metadata": {
        "id": "PIMeJ25QlT_D"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "If you can see Japanese text in the last line of the output, then it's successful."
      ],
      "metadata": {
        "id": "w-y_mhZMEXB3"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## More on speech recognition results\n",
        "\n",
        "The recognition result contains speech timestamps:"
      ],
      "metadata": {
        "id": "fIwe-YGoLh7B"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "for seg in ret.segments:\n",
        "  print(\"%5.2f %5.2f %s\" % (seg.start_seconds, seg.end_seconds, seg.text))"
      ],
      "metadata": {
        "id": "7oHyyHBJLp4s"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Subword timestamps are available too:"
      ],
      "metadata": {
        "id": "DglzhjwCMefd"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "for word in ret.subwords[1:10]:\n",
        "  print(\"%5.2f %s\" % (word.seconds, word.token))"
      ],
      "metadata": {
        "id": "X5wpl7-YMk18"
      },
      "execution_count": null,
      "outputs": []
    }
  ]
}